Metric Indexes for Approximate String Matching in a Dictionary

نویسنده

Kimmo Fredriksson

چکیده

We consider the problem of finding all approximate occurrences of a given string q, with at most k differences, in a finite database or dictionary of strings. The strings can be e.g. natural language words, such as the vocabulary of some document or set of documents. This has many important application in both offline (indexed) and on-line string matching. More precisely, we have a universe U of strings, and a non-negative distance function d : U× U→ N. The distance function is metric, if it satisfies (i) d(x, y) = 0 ⇔ x = y; (ii) d(x, y) = d(y, x); (iii) d(x, y) ≤ d(x, z)+d(z, y). The last item is called the “triangular inequality”, and is the most important property in our case. Many useful distance functions are known to be metric, in particular edit (Levenshtein) distance is metric, which we will use for d. Our dictionary S is a finite subset of that universe, i.e. S ⊆ U. S is preprocessed in order to efficiently answer range queries. Given a query string q, we retrieve all strings in S that are close enough to q, i.e. we retrieve the set {u ∈ S | d(q, u) ≤ k} for some k. To solve the problem, we build a metric index over the dictionary, and use the triangular inequality to efficiently prune the search. This is not a new idea, huge number of different indexes have been proposed over the years, see [2] for a recent survey. An example of such an index is the Burkhard-Keller tree [1]. They build a hierarchy as follows. Some arbitrary string (called pivot) p ∈ S is chosen for the root of the tree. The child number e is recursively built using the set Se = {u ∈ S \ {p} | d(p, u) = e}. This is repeated until there are only one, or in general b (for a bucket), strings left, which are stored into the leaves of the tree. The tree has O(n) nodes, where n = |S|, and the construction requires O(n log n) distance computations on average. The search with the query string q and range k first evaluates the distance d(q, p), where p is the string in the root of the tree. If d(q, p) ≤ k, then p is put into the output list. The search then recursively enters into each child e such that d(q, p) − k ≤ e ≤ d(q, p) + k. Whenever the search reaches a leaf, the stored bucket of strings are directly compared against q. The search requires O(n) distance computations on average, where 0 < α < 1. Another example is Approximating Eliminating Search Algorithm (AESA) [4], which is an extreme case of pivot based algorithms. This time there is not any hierarchy, but the data structure is simply a precomputed matrix of all the n(n−1)/2 distances between the n strings in S. The space complexity is therefore O(n) and the matrix is computed with O(n) edit distance computations. This makes the structure highly impractical for large n. The benefit comes from search

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding Approximate Matches in Large Lexicons

Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures an...

متن کامل

Approximate String Matching: Theory and Applications (La Recherche Approchée de Motifs : Théorie et Applications)

The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows : Let D = {x1, x2, . . . xd} be a set of d words defined on an alphabet Σ, let q be a query defined also on Σ, and let k be a positive integer. We want to build a data structure on D capable of answering the following query : find all words i...

متن کامل

Fast Approximate String Matching in a Dictionary

A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that on-line search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms , when the tolerated error level is low (which is reasonable in tex...

متن کامل

Approximate String Matching ? Edgar

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suux tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us nding the R occurrences of ...

متن کامل

Approximate string matching algorithms for limited-vocabulary OCR output correction

Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Metric Indexes for Approximate String Matching in a Dictionary

نویسنده

چکیده

منابع مشابه

Finding Approximate Matches in Large Lexicons

Approximate String Matching: Theory and Applications (La Recherche Approchée de Motifs : Théorie et Applications)

Fast Approximate String Matching in a Dictionary

Approximate String Matching ? Edgar

Approximate string matching algorithms for limited-vocabulary OCR output correction

عنوان ژورنال:

اشتراک گذاری